The objective of this project is to implement data cleaning, visualizaiton and analysis on a real set of data and to familiarize ourselves with the process.
For this project, I use the “County-level Oil and Gas Production in the U.S.” as my data set. A brief preview of the data set is shown as the following:
data1<-read.xls("oilgascounty.xls", sheet=1)
head(data1)
## FIPS geoid Stabr County_Name Rural_Urban_Continuum_Code_2013
## 1 1001 1001 AL Autauga County 2
## 2 1003 1003 AL Baldwin County 3
## 3 1005 1005 AL Barbour County 6
## 4 1007 1007 AL Bibb County 1
## 5 1009 1009 AL Blount County 1
## 6 1011 1011 AL Bullock County 6
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 1 2 1 2
## 2 2 1 2
## 3 6 0 0
## 4 1 1 2
## 5 1 1 2
## 6 6 0 0
## oil2000 oil2001 oil2002 oil2003 oil2004 oil2005 oil2006 oil2007 oil2008
## 1 0 0 0 0 0 0 0 0 0
## 2 138,072 134,666 138,011 127,985 130,763 118,043 103,992 112,303 97,623
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## oil2009 oil2010 oil2011 gas2000 gas2001 gas2002 gas2003
## 1 0 0 0 0 0 0 0
## 2 84,982 101,955 94,638 72,543,902 98,699,994 107,142,655 101,510,068
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## gas2004 gas2005 gas2006 gas2007 gas2008 gas2009
## 1 0 0 0 0 0 0
## 2 90,146,850 84,536,875 83,951,640 82,876,786 78,547,145 68,525,628
## 3 0 0 0 0 0 0
## 4 0 8,301 98,853 480,015 684,143 551,719
## 5 0 0 0 0 20,516 61,054
## 6 0 0 0 0 0 0
## gas2010 gas2011 oil_change_group gas_change_group
## 1 0 0 Status Quo Status Quo
## 2 63,069,025 51,041,072 Status Quo H_Decline
## 3 0 0 Status Quo Status Quo
## 4 453,132 400,504 Status Quo Status Quo
## 5 3,594 21,496 Status Quo Status Quo
## 6 0 0 Status Quo Status Quo
## oil_gas_change_group
## 1 Status Quo
## 2 H_Decline
## 3 Status Quo
## 4 Status Quo
## 5 Status Quo
## 6 Status Quo
dim(data1)
## [1] 3109 35
As shown in the preview, the data set is relatively messy and there are several things that can be tidied up. For example, redundancies exist in the data set like FIPS and geoid—they are essentially representing the same thing; in addition, data are not organized by year, which makes it hard to manipulate the data of a specific year.
In order to prepare the data for visualization and analysis, I perform a series of steps to clean the data. First of all, I performed a common routine to tidy up the basics redundancies. The first step will be to getting rid of columns without values:
data2 <- data1[(colSums(is.na(data1)) != nrow(data1))]
head(data2)
## FIPS geoid Stabr County_Name Rural_Urban_Continuum_Code_2013
## 1 1001 1001 AL Autauga County 2
## 2 1003 1003 AL Baldwin County 3
## 3 1005 1005 AL Barbour County 6
## 4 1007 1007 AL Bibb County 1
## 5 1009 1009 AL Blount County 1
## 6 1011 1011 AL Bullock County 6
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 1 2 1 2
## 2 2 1 2
## 3 6 0 0
## 4 1 1 2
## 5 1 1 2
## 6 6 0 0
## oil2000 oil2001 oil2002 oil2003 oil2004 oil2005 oil2006 oil2007 oil2008
## 1 0 0 0 0 0 0 0 0 0
## 2 138,072 134,666 138,011 127,985 130,763 118,043 103,992 112,303 97,623
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## oil2009 oil2010 oil2011 gas2000 gas2001 gas2002 gas2003
## 1 0 0 0 0 0 0 0
## 2 84,982 101,955 94,638 72,543,902 98,699,994 107,142,655 101,510,068
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## gas2004 gas2005 gas2006 gas2007 gas2008 gas2009
## 1 0 0 0 0 0 0
## 2 90,146,850 84,536,875 83,951,640 82,876,786 78,547,145 68,525,628
## 3 0 0 0 0 0 0
## 4 0 8,301 98,853 480,015 684,143 551,719
## 5 0 0 0 0 20,516 61,054
## 6 0 0 0 0 0 0
## gas2010 gas2011 oil_change_group gas_change_group
## 1 0 0 Status Quo Status Quo
## 2 63,069,025 51,041,072 Status Quo H_Decline
## 3 0 0 Status Quo Status Quo
## 4 453,132 400,504 Status Quo Status Quo
## 5 3,594 21,496 Status Quo Status Quo
## 6 0 0 Status Quo Status Quo
## oil_gas_change_group
## 1 Status Quo
## 2 H_Decline
## 3 Status Quo
## 4 Status Quo
## 5 Status Quo
## 6 Status Quo
dim(data2)
## [1] 3109 35
From the preview and the dimension of the data before and after cleaning, one can see that there is no column without values.
Then, I delete the cols that only have one value:
data3<-Filter(function(x)(length(unique(x))>1), data2)
head(data3)
## FIPS geoid Stabr County_Name Rural_Urban_Continuum_Code_2013
## 1 1001 1001 AL Autauga County 2
## 2 1003 1003 AL Baldwin County 3
## 3 1005 1005 AL Barbour County 6
## 4 1007 1007 AL Bibb County 1
## 5 1009 1009 AL Blount County 1
## 6 1011 1011 AL Bullock County 6
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 1 2 1 2
## 2 2 1 2
## 3 6 0 0
## 4 1 1 2
## 5 1 1 2
## 6 6 0 0
## oil2000 oil2001 oil2002 oil2003 oil2004 oil2005 oil2006 oil2007 oil2008
## 1 0 0 0 0 0 0 0 0 0
## 2 138,072 134,666 138,011 127,985 130,763 118,043 103,992 112,303 97,623
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## oil2009 oil2010 oil2011 gas2000 gas2001 gas2002 gas2003
## 1 0 0 0 0 0 0 0
## 2 84,982 101,955 94,638 72,543,902 98,699,994 107,142,655 101,510,068
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## gas2004 gas2005 gas2006 gas2007 gas2008 gas2009
## 1 0 0 0 0 0 0
## 2 90,146,850 84,536,875 83,951,640 82,876,786 78,547,145 68,525,628
## 3 0 0 0 0 0 0
## 4 0 8,301 98,853 480,015 684,143 551,719
## 5 0 0 0 0 20,516 61,054
## 6 0 0 0 0 0 0
## gas2010 gas2011 oil_change_group gas_change_group
## 1 0 0 Status Quo Status Quo
## 2 63,069,025 51,041,072 Status Quo H_Decline
## 3 0 0 Status Quo Status Quo
## 4 453,132 400,504 Status Quo Status Quo
## 5 3,594 21,496 Status Quo Status Quo
## 6 0 0 Status Quo Status Quo
## oil_gas_change_group
## 1 Status Quo
## 2 H_Decline
## 3 Status Quo
## 4 Status Quo
## 5 Status Quo
## 6 Status Quo
dim(data3)
## [1] 3109 35
I also delete the cols that have the same value:
data4<-t(unique(t(data3)))
data4<-data.frame(data4)
dim(data4)
## [1] 3109 34
To better process the data, I get rid of comma style format(thousands seperator)
data5<-{
data4 %>%
mutate_each(funs(as.character(.)), oil2000:gas2011) %>%
mutate_each(funs(gsub(",", "", .)), oil2000:gas2011) %>%
mutate_each(funs(as.numeric(.)), oil2000:gas2011)}
head(data5)
## FIPS Stabr County_Name Rural_Urban_Continuum_Code_2013
## 1 1001 AL Autauga County 2
## 2 1003 AL Baldwin County 3
## 3 1005 AL Barbour County 6
## 4 1007 AL Bibb County 1
## 5 1009 AL Blount County 1
## 6 1011 AL Bullock County 6
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 1 2 1 2
## 2 2 1 2
## 3 6 0 0
## 4 1 1 2
## 5 1 1 2
## 6 6 0 0
## oil2000 oil2001 oil2002 oil2003 oil2004 oil2005 oil2006 oil2007 oil2008
## 1 0 0 0 0 0 0 0 0 0
## 2 138072 134666 138011 127985 130763 118043 103992 112303 97623
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## oil2009 oil2010 oil2011 gas2000 gas2001 gas2002 gas2003 gas2004
## 1 0 0 0 0 0 0 0 0
## 2 84982 101955 94638 72543902 98699994 107142655 101510068 90146850
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## gas2005 gas2006 gas2007 gas2008 gas2009 gas2010 gas2011
## 1 0 0 0 0 0 0 0
## 2 84536875 83951640 82876786 78547145 68525628 63069025 51041072
## 3 0 0 0 0 0 0 0
## 4 8301 98853 480015 684143 551719 453132 400504
## 5 0 0 0 20516 61054 3594 21496
## 6 0 0 0 0 0 0 0
## oil_change_group gas_change_group oil_gas_change_group
## 1 Status Quo Status Quo Status Quo
## 2 Status Quo H_Decline H_Decline
## 3 Status Quo Status Quo Status Quo
## 4 Status Quo Status Quo Status Quo
## 5 Status Quo Status Quo Status Quo
## 6 Status Quo Status Quo Status Quo
dim(data5)
## [1] 3109 34
After the basic cleanings, I continue to tidy up the data by assigning the data types and changing columns to factor as needed.
#factor
data5$Urban_Influence_2013<-as.factor(data5$Urban_Influence_2013)
data5$Rural_Urban_Continuum_Code_2013<-as.factor(data5$Rural_Urban_Continuum_Code_2013)
data5$oil_change_group <- as.factor(data5$oil_change_group)
data5$gas_change_group <- as.factor(data5$gas_change_group)
data5$oil_gas_change_group <- as.factor(data5$oil_gas_change_group)
data5$County_Name <- gsub("County", "", data5$County_Name)
To further clean the data, I clean the rows whose annual gross withdrawals of curde oil and natural gas equals to zero since I will be analyzing the regions with oil and gas outputs. I then also create a subset “value”, which only contains data of oil and gas withdrawals.
nums <- sapply(data5, is.numeric)
value <- data5[,nums]
x=NULL
for(i in 1:nrow(value)){
if(sum(value[i,])==0){x=c(x,i)}}
data6<-data5[-x,]
head(data6)
## FIPS Stabr County_Name Rural_Urban_Continuum_Code_2013
## 2 1003 AL Baldwin 3
## 4 1007 AL Bibb 1
## 5 1009 AL Blount 1
## 12 1023 AL Choctaw 9
## 13 1025 AL Clarke 7
## 18 1035 AL Conecuh 7
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 2 2 1 2
## 4 1 1 2
## 5 1 1 2
## 12 10 0 0
## 13 11 0 0
## 18 11 0 0
## oil2000 oil2001 oil2002 oil2003 oil2004 oil2005 oil2006 oil2007 oil2008
## 2 138072 134666 138011 127985 130763 118043 103992 112303 97623
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 12 652067 603845 564213 527582 463107 424154 363061 429449 442936
## 13 330933 319891 285497 301260 252415 235616 214651 201348 191449
## 18 110132 142044 96223 116122 423467 1207237 1687577 1821208 2531941
## oil2009 oil2010 oil2011 gas2000 gas2001 gas2002 gas2003 gas2004
## 2 84982 101955 94638 72543902 98699994 107142655 101510068 90146850
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 12 350345 375372 372512 283457 308097 351183 333110 330232
## 13 183050 169480 170611 100704 102275 85661 79814 88344
## 18 2485039 2352547 3294092 132402 139643 51886 105935 411087
## gas2005 gas2006 gas2007 gas2008 gas2009 gas2010 gas2011
## 2 84536875 83951640 82876786 78547145 68525628 63069025 51041072
## 4 8301 98853 480015 684143 551719 453132 400504
## 5 0 0 0 20516 61054 3594 21496
## 12 293620 256483 337575 677198 343412 362477 326952
## 13 85342 66184 61460 55108 56843 41448 43093
## 18 1055787 1488366 1612418 1950537 2597540 3040205 4078749
## oil_change_group gas_change_group oil_gas_change_group
## 2 Status Quo H_Decline H_Decline
## 4 Status Quo Status Quo Status Quo
## 5 Status Quo Status Quo Status Quo
## 12 Status Quo Status Quo Status Quo
## 13 Status Quo Status Quo Status Quo
## 18 H_Growth H_Growth H_Growth
dim(data6)
## [1] 1166 34
As mentioned before, oil and gas productions by year are recorded as individual columns such as “oil2001” and “gas2001”. I tidy it up by extracting the year and reorganzing the data so that oil and gas only have one column respectively.
#data frame (tidy data)
#oil= subset that does not contain gas data
#gas= subset that does not contain oil data
oil<- data.frame(subset(data6, select = -c(gas2000:gas2011)))
gas<- data.frame(subset(data6, select = -c(oil2000:oil2011)))
#reorganizing oil data by year
oil1 <- oil %>%
gather(Year, oil, oil2000:oil2011)%>%
mutate(Year = gsub("oil", "", Year))%>%
arrange(County_Name, Year)
#reorganzing gas data by year
gas1<- gas%>%
gather(Year, gas, gas2000:gas2011)%>%
mutate(Year = gsub("gas", "", Year))%>%
arrange(County_Name, Year)
#merge two data frame, get tidy data
data_clean<-merge(gas1,oil1)
The final tidy data looks like the following:
## FIPS Stabr County_Name Rural_Urban_Continuum_Code_2013
## 1 1003 AL Baldwin 3
## 2 1003 AL Baldwin 3
## 3 1003 AL Baldwin 3
## 4 1003 AL Baldwin 3
## 5 1003 AL Baldwin 3
## 6 1003 AL Baldwin 3
## Urban_Influence_2013 Metro_Nonmetro_2013 Metro_Micro_Noncore_2013
## 1 2 1 2
## 2 2 1 2
## 3 2 1 2
## 4 2 1 2
## 5 2 1 2
## 6 2 1 2
## oil_change_group gas_change_group oil_gas_change_group Year gas
## 1 Status Quo H_Decline H_Decline 2000 72543902
## 2 Status Quo H_Decline H_Decline 2001 98699994
## 3 Status Quo H_Decline H_Decline 2002 107142655
## 4 Status Quo H_Decline H_Decline 2003 101510068
## 5 Status Quo H_Decline H_Decline 2004 90146850
## 6 Status Quo H_Decline H_Decline 2005 84536875
## oil
## 1 138072
## 2 134666
## 3 138011
## 4 127985
## 5 130763
## 6 118043
With the tidy data, I could do some visualization to get the general idea of the oil and gas production distribution across the U.S. As a start, I make a Choropleth map of the overall oil and gas productions of the counties from 2001 to 2011.
I start first by combining the oil and gas productions of the 10-year period for each county:
oilvalue<-value[,1:(length(value)/2)]
gasvalue<-value[,(length(value)/2+1):length(value)]
oilsum<-data.frame(rowSums(oilvalue))
gassum<-data.frame(rowSums(gasvalue))
colnames(oilsum)[1]<-"value"
colnames(gassum)[1]<-"value"
df <- data1
aa<-data.frame(subset(df,select = c(1,3,4)))
colnames(aa)[1]<-'region'
oilsum<-cbind(aa,oilsum)
gassum<-cbind(aa,gassum)
I then plot the oil production distribution across the US:
oilcho = CountyChoropleth$new(oilsum)
oilcho$title = "Oil Production across US"
oilcho$ggplot_scale = scale_fill_brewer(name="Barrel Produced", palette=2, drop=FALSE)
oilcho$render()
Also the gas prodcution distribution:
gascho = CountyChoropleth$new(gassum)
gascho$title = "Gas Production across US"
gascho$ggplot_scale = scale_fill_brewer(name="Barrel Produced", palette=3, drop=FALSE)
gascho$render()
From the distribution maps, it can be concluded that majority of the oil and gas production comes from the middle part of the U.S.
I also plot the oil and gas production based on different area codes:
#Plot the oil and gas production based on the Metro, Micro, and Noncore statues
plot_ly(data = data_clean, x = ~Year , y = ~oil, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = data_clean, x = ~Year , y = ~gas, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = data_clean, x = ~Year , y = ~oil, color = ~Rural_Urban_Continuum_Code_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = data_clean, x = ~Year , y = ~gas, color = ~Rural_Urban_Continuum_Code_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
The mean and standard deviation can be better visualized in the box plots:
#Plot the oil and gas production based on Rural Urban Continuum Code 2013
#
plot_ly(data=data_clean, y = ~oil, color = ~Rural_Urban_Continuum_Code_2013, type = "box")
plot_ly(data=data_clean, y = ~gas, color = ~Rural_Urban_Continuum_Code_2013, type = "box")
plot_ly(data=data_clean, y = ~oil, color = ~Metro_Micro_Noncore_2013, type = "box")
plot_ly(data=data_clean, y = ~gas, color = ~Metro_Micro_Noncore_2013, type = "box")
After seeing the productions distribution among areas, I continue to visualize the data with time series scatter plot to see the production trend over the 12-year period.
#scatter plot (x=year, y=sum(value))
#oil hyperbola trend, gas increasing trend
op<-data.frame(colSums(oilvalue))
op$oil<-op$colSums.oilvalue.
op$Year<- c(2000:2011)
og<-data.frame(colSums(gasvalue))
og$gas<-og$colSums.gasvalue.
og$Year<- c(2000:2011)
plot_ly(op, x=~Year, y=~oil, type = "scatter")
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(og, x=~Year, y=~gas, type = "scatter")
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Finally we can look at which state produces the most oil and gas
#which state produce the most?
plot_ly(data_clean, x=~ Stabr, y=~ oil)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data_clean, x=~ Stabr, y=~ gas)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
For analysis, we can start by seeing the summary of tidy data of the oil and gas individually.
#summary
summary(data_clean$oil)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1606 72720 982400 507300 208800000
summary(data_clean$gas)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 5.291e+03 6.199e+05 1.549e+07 7.601e+06 1.198e+09
From the plots in the Data Visualization section, it can be seen that there are outliers in the data, making it hard for us to see the distribution among areas. Therefore, I use the following code to remove the outliers in the data set.
#remove outlier
#reference: https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
outlierKD <- function(dt, var) {
var_name <- eval(substitute(var),eval(dt))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = T)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check", outer=TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1)
m2 <- mean(var_name, na.rm = T)
dt[as.character(substitute(var))] <- invisible(var_name)
assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
return(invisible(dt))
}
no_outlier<-data_clean
no_outlier<-outlierKD(no_outlier, oil)
## Outliers identified: 1929
no_outlier<-outlierKD(no_outlier, gas)
## Outliers identified: 2203
no_outlier[no_outlier == 0 & is.numeric(no_outlier)] <- NA
I then use the data set without outlier to replot the plots in the previous section:
#Plot the oil production based on the Metro, Micro, and Noncore statues
plot_ly(data = data_clean, x = ~Year , y = ~oil, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = no_outlier, x = ~Year , y = ~oil, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = data_clean, x = ~Year , y = ~gas, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
plot_ly(data = no_outlier, x = ~Year , y = ~gas, color = ~Metro_Micro_Noncore_2013)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
#Plot the oil and gas production based on Rural Urban Continuum Code 2013
#
plot_ly(data=no_outlier, y = ~oil, color = ~Rural_Urban_Continuum_Code_2013, type = "box")
plot_ly(data=no_outlier, y = ~gas, color = ~Rural_Urban_Continuum_Code_2013, type = "box")
plot_ly(data=no_outlier, y = ~oil, color = ~Metro_Micro_Noncore_2013, type = "box")
plot_ly(data=no_outlier, y = ~gas, color = ~Metro_Micro_Noncore_2013, type = "box")